House Prices: Advanced Regression Techniques

Kaggle Competition

Problem Summary

Problem Type: Regression
Target: SalePrice (continuous)
There are 81 attributes. 1460 observations in the training set

Each record in the database describes a Boston suburb or town.

I'll evaluate the algorithms' performance using RMSE and R2 metrics.

RMSE will give a gross idea of how wrong all predictions are (0 is perfect) and R2 will give an idea of how well the model has fit the data (1 is perfect, 0 is worst).

Results Summary

My best-performing model had a RMSE score of 2.81 on an unseen validation set. It used a tuned SVM algorithm. The second best was a tuned Cubist algorithm with an RMSE score of 2.90 on a validation set.

Setup



In [111]:

    
# first install packages devtools and pacman manually

#pull in source functions from github
devtools::source_url('https://raw.githubusercontent.com/jsphyg/Machine_Learning_Notebooks/master/myRfunctions.R')
#source("C:\\Work\\myRfunctions.R")
fnRunDate()
fnInstallPackages()









    



SHA-1 hash of file is bfa47fa27ed24887838ddcd2c9aa1d862c080011






    




'Project last run on Fri Sep 22 2:08:49 PM 2017'






    




'Package install completed'



In [112]:

    
# import data changing any blanks and string NAs to NAs
dataset <- read_csv("C:\\Work\\kaggle_house_prices\\train.csv", na = c("","NA"))
test <- read_csv("C:\\Work\\kaggle_house_prices\\test.csv", na = c("","NA"))



head(dataset)
head(test)









    



Parsed with column specification:
cols(
  .default = col_character(),
  Id = col_integer(),
  MSSubClass = col_integer(),
  LotFrontage = col_integer(),
  LotArea = col_integer(),
  OverallQual = col_integer(),
  OverallCond = col_integer(),
  YearBuilt = col_integer(),
  YearRemodAdd = col_integer(),
  MasVnrArea = col_integer(),
  BsmtFinSF1 = col_integer(),
  BsmtFinSF2 = col_integer(),
  BsmtUnfSF = col_integer(),
  TotalBsmtSF = col_integer(),
  `1stFlrSF` = col_integer(),
  `2ndFlrSF` = col_integer(),
  LowQualFinSF = col_integer(),
  GrLivArea = col_integer(),
  BsmtFullBath = col_integer(),
  BsmtHalfBath = col_integer(),
  FullBath = col_integer()
  # ... with 18 more columns
)
See spec(...) for full column specifications.
Parsed with column specification:
cols(
  .default = col_character(),
  Id = col_integer(),
  MSSubClass = col_integer(),
  LotFrontage = col_integer(),
  LotArea = col_integer(),
  OverallQual = col_integer(),
  OverallCond = col_integer(),
  YearBuilt = col_integer(),
  YearRemodAdd = col_integer(),
  MasVnrArea = col_integer(),
  BsmtFinSF1 = col_integer(),
  BsmtFinSF2 = col_integer(),
  BsmtUnfSF = col_integer(),
  TotalBsmtSF = col_integer(),
  `1stFlrSF` = col_integer(),
  `2ndFlrSF` = col_integer(),
  LowQualFinSF = col_integer(),
  GrLivArea = col_integer(),
  BsmtFullBath = col_integer(),
  BsmtHalfBath = col_integer(),
  FullBath = col_integer()
  # ... with 17 more columns
)
See spec(...) for full column specifications.






    





Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice

	1      60     RL     65      8450  Pave   NA     Reg    Lvl    AllPub ...    0      NA     NA     NA       0     2     2008   WD     Normal 208500 
	2      20     RL     80      9600  Pave   NA     Reg    Lvl    AllPub ...    0      NA     NA     NA       0     5     2007   WD     Normal 181500 
	3      60     RL     68     11250  Pave   NA     IR1    Lvl    AllPub ...    0      NA     NA     NA       0     9     2008   WD     Normal 223500 
	4      70     RL     60      9550  Pave   NA     IR1    Lvl    AllPub ...    0      NA     NA     NA       0     2     2006   WD     Abnorml 140000 
	5      60     RL     84     14260  Pave   NA     IR1    Lvl    AllPub ...    0      NA     NA     NA       0    12     2008   WD     Normal 250000 
	6      50     RL     85     14115  Pave   NA     IR1    Lvl    AllPub ...    0      NA     MnPrv  Shed   700    10     2009   WD     Normal 143000 









    





Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition

	1461   20   RH    80    11622 Pave  NA    Reg   Lvl   AllPub ...   120   0     NA    MnPrv NA        0 6     2010  WD    Normal
	1462   20   RL    81    14267 Pave  NA    IR1   Lvl   AllPub ...     0   0     NA    NA    Gar2  12500 6     2010  WD    Normal
	1463   60   RL    74    13830 Pave  NA    IR1   Lvl   AllPub ...     0   0     NA    MnPrv NA        0 3     2010  WD    Normal
	1464   60   RL    78     9978 Pave  NA    IR1   Lvl   AllPub ...     0   0     NA    NA    NA        0 6     2010  WD    Normal
	1465  120   RL    43     5005 Pave  NA    IR1   HLS   AllPub ...   144   0     NA    NA    NA        0 1     2010  WD    Normal
	1466   60   RL    75    10000 Pave  NA    IR1   Lvl   AllPub ...     0   0     NA    NA    NA        0 4     2010  WD    Normal



In [113]:

    
Hmisc::describe(dataset, listunique=1)









    





dataset 

 81  Variables      1460  Observations
--------------------------------------------------------------------------------
Id 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0    1460       1   730.5   73.95  146.90  365.75  730.50 1095.25 
    .90     .95 
1314.10 1387.05 

lowest :    1    2    3    4    5, highest: 1456 1457 1458 1459 1460 
--------------------------------------------------------------------------------
MSSubClass 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      15    0.94    56.9      20      20      20      50      70 
    .90     .95 
    120     160 

           20 30 40 45  50  60 70 75 80 85 90 120 160 180 190
Frequency 536 69  4 12 144 299 60 16 58 20 52  87  63  10  30
%          37  5  0  1  10  20  4  1  4  1  4   6   4   1   2
--------------------------------------------------------------------------------
MSZoning 
      n missing  unique 
   1460       0       5 

          C (all) FV RH   RL  RM
Frequency      10 65 16 1151 218
%               1  4  1   79  15
--------------------------------------------------------------------------------
LotFrontage 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1201     259     110       1   70.05      34      44      59      69      80 
    .90     .95 
     96     107 

lowest :  21  24  30  32  33, highest: 160 168 174 182 313 
--------------------------------------------------------------------------------
LotArea 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0    1073       1   10517    3312    5000    7554    9478   11602 
    .90     .95 
  14382   17401 

lowest :   1300   1477   1491   1526   1533
highest:  70761 115149 159000 164660 215245 
--------------------------------------------------------------------------------
Street 
      n missing  unique 
   1460       0       2 

Grvl (6, 0%), Pave (1454, 100%) 
--------------------------------------------------------------------------------
Alley 
      n missing  unique 
     91    1369       2 

Grvl (50, 55%), Pave (41, 45%) 
--------------------------------------------------------------------------------
LotShape 
      n missing  unique 
   1460       0       4 

IR1 (484, 33%), IR2 (41, 3%), IR3 (10, 1%), Reg (925, 63%) 
--------------------------------------------------------------------------------
LandContour 
      n missing  unique 
   1460       0       4 

Bnk (63, 4%), HLS (50, 3%), Low (36, 2%), Lvl (1311, 90%) 
--------------------------------------------------------------------------------
Utilities 
      n missing  unique 
   1460       0       2 

AllPub (1459, 100%), NoSeWa (1, 0%) 
--------------------------------------------------------------------------------
LotConfig 
      n missing  unique 
   1460       0       5 

          Corner CulDSac FR2 FR3 Inside
Frequency    263      94  47   4   1052
%             18       6   3   0     72
--------------------------------------------------------------------------------
LandSlope 
      n missing  unique 
   1460       0       3 

Gtl (1382, 95%), Mod (65, 4%), Sev (13, 1%) 
--------------------------------------------------------------------------------
Neighborhood 
      n missing  unique 
   1460       0      25 

lowest : Blmngtn Blueste BrDale  BrkSide ClearCr
highest: Somerst StoneBr SWISU   Timber  Veenker 
--------------------------------------------------------------------------------
Condition1 
      n missing  unique 
   1460       0       9 

          Artery Feedr Norm PosA PosN RRAe RRAn RRNe RRNn
Frequency     48    81 1260    8   19   11   26    2    5
%              3     6   86    1    1    1    2    0    0
--------------------------------------------------------------------------------
Condition2 
      n missing  unique 
   1460       0       8 

          Artery Feedr Norm PosA PosN RRAe RRAn RRNn
Frequency      2     6 1445    1    2    1    1    2
%              0     0   99    0    0    0    0    0
--------------------------------------------------------------------------------
BldgType 
      n missing  unique 
   1460       0       5 

          1Fam 2fmCon Duplex Twnhs TwnhsE
Frequency 1220     31     52    43    114
%           84      2      4     3      8
--------------------------------------------------------------------------------
HouseStyle 
      n missing  unique 
   1460       0       8 

          1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer SLvl
Frequency    154     14    726      8     11    445     37   65
%             11      1     50      1      1     30      3    4
--------------------------------------------------------------------------------
OverallQual 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      10    0.95   6.099       4       5       5       6       7 
    .90     .95 
      8       8 

          1 2  3   4   5   6   7   8  9 10
Frequency 2 3 20 116 397 374 319 168 43 18
%         0 0  1   8  27  26  22  12  3  1
--------------------------------------------------------------------------------
OverallCond 
      n missing  unique    Info    Mean 
   1460       0       9    0.81   5.575 

          1 2  3  4   5   6   7  8  9
Frequency 1 5 25 57 821 252 205 72 22
%         0 0  2  4  56  17  14  5  2
--------------------------------------------------------------------------------
YearBuilt 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     112       1    1971    1916    1925    1954    1973    2000 
    .90     .95 
   2006    2007 

lowest : 1872 1875 1880 1882 1885, highest: 2006 2007 2008 2009 2010 
--------------------------------------------------------------------------------
YearRemodAdd 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      61       1    1985    1950    1950    1967    1994    2004 
    .90     .95 
   2006    2007 

lowest : 1950 1951 1952 1953 1954, highest: 2006 2007 2008 2009 2010 
--------------------------------------------------------------------------------
RoofStyle 
      n missing  unique 
   1460       0       6 

          Flat Gable Gambrel Hip Mansard Shed
Frequency   13  1141      11 286       7    2
%            1    78       1  20       0    0
--------------------------------------------------------------------------------
RoofMatl 
      n missing  unique 
   1460       0       8 

          ClyTile CompShg Membran Metal Roll Tar&Grv WdShake WdShngl
Frequency       1    1434       1     1    1      11       5       6
%               0      98       0     0    0       1       0       0
--------------------------------------------------------------------------------
Exterior1st 
      n missing  unique 
   1460       0      15 

          AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc
Frequency      20       1       2      50      1      61     222       1
%               1       0       0       3      0       4      15       0
          MetalSd Plywood Stone Stucco VinylSd Wd Sdng WdShing
Frequency     220     108     2     25     515     206      26
%              15       7     0      2      35      14       2
--------------------------------------------------------------------------------
Exterior2nd 
      n missing  unique 
   1460       0      16 

          AsbShng AsphShn Brk Cmn BrkFace CBlock CmentBd HdBoard ImStucc
Frequency      20       3       7      25      1      60     207      10
%               1       0       0       2      0       4      14       1
          MetalSd Other Plywood Stone Stucco VinylSd Wd Sdng Wd Shng
Frequency     214     1     142     5     26     504     197      38
%              15     0      10     0      2      35      13       3
--------------------------------------------------------------------------------
MasVnrType 
      n missing  unique 
   1452       8       4 

BrkCmn (15, 1%), BrkFace (445, 31%), None (864, 60%) 
Stone (128, 9%) 
--------------------------------------------------------------------------------
MasVnrArea 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1452       8     327    0.79   103.7       0       0       0       0     166 
    .90     .95 
    335     456 

lowest :    0    1   11   14   16, highest: 1115 1129 1170 1378 1600 
--------------------------------------------------------------------------------
ExterQual 
      n missing  unique 
   1460       0       4 

Ex (52, 4%), Fa (14, 1%), Gd (488, 33%), TA (906, 62%) 
--------------------------------------------------------------------------------
ExterCond 
      n missing  unique 
   1460       0       5 

          Ex Fa  Gd Po   TA
Frequency  3 28 146  1 1282
%          0  2  10  0   88
--------------------------------------------------------------------------------
Foundation 
      n missing  unique 
   1460       0       6 

          BrkTil CBlock PConc Slab Stone Wood
Frequency    146    634   647   24     6    3
%             10     43    44    2     0    0
--------------------------------------------------------------------------------
BsmtQual 
      n missing  unique 
   1423      37       4 

Ex (121, 9%), Fa (35, 2%), Gd (618, 43%), TA (649, 46%) 
--------------------------------------------------------------------------------
BsmtCond 
      n missing  unique 
   1423      37       4 

Fa (45, 3%), Gd (65, 5%), Po (2, 0%), TA (1311, 92%) 
--------------------------------------------------------------------------------
BsmtExposure 
      n missing  unique 
   1422      38       4 

Av (221, 16%), Gd (134, 9%), Mn (114, 8%), No (953, 67%) 
--------------------------------------------------------------------------------
BsmtFinType1 
      n missing  unique 
   1423      37       6 

          ALQ BLQ GLQ LwQ Rec Unf
Frequency 220 148 418  74 133 430
%          15  10  29   5   9  30
--------------------------------------------------------------------------------
BsmtFinSF1 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     637    0.97   443.6     0.0     0.0     0.0   383.5   712.2 
    .90     .95 
 1065.5  1274.0 

lowest :    0    2   16   20   24, highest: 1904 2096 2188 2260 5644 
--------------------------------------------------------------------------------
BsmtFinType2 
      n missing  unique 
   1422      38       6 

          ALQ BLQ GLQ LwQ Rec  Unf
Frequency  19  33  14  46  54 1256
%           1   2   1   3   4   88
--------------------------------------------------------------------------------
BsmtFinSF2 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     144    0.31   46.55     0.0     0.0     0.0     0.0     0.0 
    .90     .95 
  117.2   396.2 

lowest :    0   28   32   35   40, highest: 1080 1085 1120 1127 1474 
--------------------------------------------------------------------------------
BsmtUnfSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     780       1   567.2     0.0    74.9   223.0   477.5   808.0 
    .90     .95 
 1232.0  1468.0 

lowest :    0   14   15   23   26, highest: 2042 2046 2121 2153 2336 
--------------------------------------------------------------------------------
TotalBsmtSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     721       1    1057   519.3   636.9   795.8   991.5  1298.2 
    .90     .95 
 1602.2  1753.0 

lowest :    0  105  190  264  270, highest: 3094 3138 3200 3206 6110 
--------------------------------------------------------------------------------
Heating 
      n missing  unique 
   1460       0       6 

          Floor GasA GasW Grav OthW Wall
Frequency     1 1428   18    7    2    4
%             0   98    1    0    0    0
--------------------------------------------------------------------------------
HeatingQC 
      n missing  unique 
   1460       0       5 

           Ex Fa  Gd Po  TA
Frequency 741 49 241  1 428
%          51  3  17  0  29
--------------------------------------------------------------------------------
CentralAir 
      n missing  unique 
   1460       0       2 

N (95, 7%), Y (1365, 93%) 
--------------------------------------------------------------------------------
Electrical 
      n missing  unique 
   1459       1       5 

          FuseA FuseF FuseP Mix SBrkr
Frequency    94    27     3   1  1334
%             6     2     0   0    91
--------------------------------------------------------------------------------
1stFlrSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     753       1    1163   673.0   756.9   882.0  1087.0  1391.2 
    .90     .95 
 1680.0  1831.2 

lowest :  334  372  438  480  483, highest: 2633 2898 3138 3228 4692 
--------------------------------------------------------------------------------
2ndFlrSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     417    0.82     347     0.0     0.0     0.0     0.0   728.0 
    .90     .95 
  954.2  1141.0 

lowest :    0  110  167  192  208, highest: 1611 1796 1818 1872 2065 
--------------------------------------------------------------------------------
LowQualFinSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      24    0.05   5.845       0       0       0       0       0 
    .90     .95 
      0       0 

lowest :   0  53  80 120 144, highest: 513 514 515 528 572 
--------------------------------------------------------------------------------
GrLivArea 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     861       1    1515     848     912    1130    1464    1777 
    .90     .95 
   2158    2466 

lowest :  334  438  480  520  605, highest: 3627 4316 4476 4676 5642 
--------------------------------------------------------------------------------
BsmtFullBath 
      n missing  unique    Info    Mean 
   1460       0       4    0.73  0.4253 

0 (856, 59%), 1 (588, 40%), 2 (15, 1%), 3 (1, 0%) 
--------------------------------------------------------------------------------
BsmtHalfBath 
      n missing  unique    Info    Mean 
   1460       0       3    0.16 0.05753 

0 (1378, 94%), 1 (80, 5%), 2 (2, 0%) 
--------------------------------------------------------------------------------
FullBath 
      n missing  unique    Info    Mean 
   1460       0       4    0.77   1.565 

0 (9, 1%), 1 (650, 45%), 2 (768, 53%), 3 (33, 2%) 
--------------------------------------------------------------------------------
HalfBath 
      n missing  unique    Info    Mean 
   1460       0       3    0.71  0.3829 

0 (913, 63%), 1 (535, 37%), 2 (12, 1%) 
--------------------------------------------------------------------------------
BedroomAbvGr 
      n missing  unique    Info    Mean 
   1460       0       8    0.82   2.866 

          0  1   2   3   4  5 6 8
Frequency 6 50 358 804 213 21 7 1
%         0  3  25  55  15  1 0 0
--------------------------------------------------------------------------------
KitchenAbvGr 
      n missing  unique    Info    Mean 
   1460       0       4    0.13   1.047 

0 (1, 0%), 1 (1392, 95%), 2 (65, 4%), 3 (2, 0%) 
--------------------------------------------------------------------------------
KitchenQual 
      n missing  unique 
   1460       0       4 

Ex (100, 7%), Fa (39, 3%), Gd (586, 40%), TA (735, 50%) 
--------------------------------------------------------------------------------
TotRmsAbvGrd 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      12    0.96   6.518       4       5       5       6       7 
    .90     .95 
      9      10 

          2  3  4   5   6   7   8  9 10 11 12 14
Frequency 1 17 97 275 402 329 187 75 47 18 11  1
%         0  1  7  19  28  23  13  5  3  1  1  0
--------------------------------------------------------------------------------
Functional 
      n missing  unique 
   1460       0       7 

          Maj1 Maj2 Min1 Min2 Mod Sev  Typ
Frequency   14    5   31   34  15   1 1360
%            1    0    2    2   1   0   93
--------------------------------------------------------------------------------
Fireplaces 
      n missing  unique    Info    Mean 
   1460       0       4    0.81   0.613 

0 (690, 47%), 1 (650, 45%), 2 (115, 8%), 3 (5, 0%) 
--------------------------------------------------------------------------------
FireplaceQu 
      n missing  unique 
    770     690       5 

          Ex Fa  Gd Po  TA
Frequency 24 33 380 20 313
%          3  4  49  3  41
--------------------------------------------------------------------------------
GarageType 
      n missing  unique 
   1379      81       6 

          2Types Attchd Basment BuiltIn CarPort Detchd
Frequency      6    870      19      88       9    387
%              0     63       1       6       1     28
--------------------------------------------------------------------------------
GarageYrBlt 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1379      81      97       1    1979    1930    1945    1961    1980    2002 
    .90     .95 
   2006    2007 

lowest : 1900 1906 1908 1910 1914, highest: 2006 2007 2008 2009 2010 
--------------------------------------------------------------------------------
GarageFinish 
      n missing  unique 
   1379      81       3 

Fin (352, 26%), RFn (422, 31%), Unf (605, 44%) 
--------------------------------------------------------------------------------
GarageCars 
      n missing  unique    Info    Mean 
   1460       0       5     0.8   1.767 

           0   1   2   3 4
Frequency 81 369 824 181 5
%          6  25  56  12 0
--------------------------------------------------------------------------------
GarageArea 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     441       1     473     0.0   240.0   334.5   480.0   576.0 
    .90     .95 
  757.1   850.1 

lowest :    0  160  164  180  186, highest: 1220 1248 1356 1390 1418 
--------------------------------------------------------------------------------
GarageQual 
      n missing  unique 
   1379      81       5 

          Ex Fa Gd Po   TA
Frequency  3 48 14  3 1311
%          0  3  1  0   95
--------------------------------------------------------------------------------
GarageCond 
      n missing  unique 
   1379      81       5 

          Ex Fa Gd Po   TA
Frequency  2 35  9  7 1326
%          0  3  1  1   96
--------------------------------------------------------------------------------
PavedDrive 
      n missing  unique 
   1460       0       3 

N (90, 6%), P (30, 2%), Y (1340, 92%) 
--------------------------------------------------------------------------------
WoodDeckSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     274    0.86   94.24       0       0       0       0     168 
    .90     .95 
    262     335 

lowest :   0  12  24  26  28, highest: 668 670 728 736 857 
--------------------------------------------------------------------------------
OpenPorchSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     202    0.91   46.66       0       0       0      25      68 
    .90     .95 
    130     175 

lowest :   0   4   8  10  11, highest: 406 418 502 523 547 
--------------------------------------------------------------------------------
EnclosedPorch 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     120    0.37   21.95     0.0     0.0     0.0     0.0     0.0 
    .90     .95 
  112.0   180.1 

lowest :   0  19  20  24  30, highest: 301 318 330 386 552 
--------------------------------------------------------------------------------
3SsnPorch 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      20    0.05    3.41       0       0       0       0       0 
    .90     .95 
      0       0 

lowest :   0  23  96 130 140, highest: 290 304 320 407 508 
--------------------------------------------------------------------------------
ScreenPorch 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      76    0.22   15.06       0       0       0       0       0 
    .90     .95 
      0     160 

lowest :   0  40  53  60  63, highest: 385 396 410 440 480 
--------------------------------------------------------------------------------
PoolArea 
      n missing  unique    Info    Mean 
   1460       0       8    0.01   2.759 

             0 480 512 519 555 576 648 738
Frequency 1453   1   1   1   1   1   1   1
%          100   0   0   0   0   0   0   0
--------------------------------------------------------------------------------
PoolQC 
      n missing  unique 
      7    1453       3 

Ex (2, 29%), Fa (2, 29%), Gd (3, 43%) 
--------------------------------------------------------------------------------
Fence 
      n missing  unique 
    281    1179       4 

GdPrv (59, 21%), GdWo (54, 19%), MnPrv (157, 56%) 
MnWw (11, 4%) 
--------------------------------------------------------------------------------
MiscFeature 
      n missing  unique 
     54    1406       4 

Gar2 (2, 4%), Othr (2, 4%), Shed (49, 91%), TenC (1, 2%) 
--------------------------------------------------------------------------------
MiscVal 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      21     0.1   43.49       0       0       0       0       0 
    .90     .95 
      0       0 

lowest :     0    54   350   400   450, highest:  2000  2500  3500  8300 15500 
--------------------------------------------------------------------------------
MoSold 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      12    0.99   6.322       2       3       5       6       8 
    .90     .95 
     10      11 

           1  2   3   4   5   6   7   8  9 10 11 12
Frequency 58 52 106 141 204 253 234 122 63 89 79 59
%          4  4   7  10  14  17  16   8  4  6  5  4
--------------------------------------------------------------------------------
YrSold 
      n missing  unique    Info    Mean 
   1460       0       5    0.96    2008 

          2006 2007 2008 2009 2010
Frequency  314  329  304  338  175
%           22   23   21   23   12
--------------------------------------------------------------------------------
SaleType 
      n missing  unique 
   1460       0       9 

          COD Con ConLD ConLI ConLw CWD New Oth   WD
Frequency  43   2     9     5     5   4 122   3 1267
%           3   0     1     0     0   0   8   0   87
--------------------------------------------------------------------------------
SaleCondition 
      n missing  unique 
   1460       0       6 

          Abnorml AdjLand Alloca Family Normal Partial
Frequency     101       4     12     20   1198     125
%               7       0      1      1     82       9
--------------------------------------------------------------------------------
SalePrice 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     663       1  180921   88000  106475  129975  163000  214000 
    .90     .95 
 278000  326100 

lowest :  34900  35311  37900  39300  40000
highest: 582933 611657 625000 745000 755000 
--------------------------------------------------------------------------------



In [114]:

    
psych::describe(dataset, check = T, skew = TRUE, ranges = TRUE, quant = TRUE)



In [115]:

    
fnMissingDataPercent(data = dataset)









    





	PoolQC
		99.52
	MiscFeature
		96.3
	Alley
		93.77
	Fence
		80.75
	FireplaceQu
		47.26
	LotFrontage
		17.74
	GarageType
		5.55
	GarageYrBlt
		5.55
	GarageFinish
		5.55
	GarageQual
		5.55
	GarageCond
		5.55
	BsmtExposure
		2.6
	BsmtFinType2
		2.6
	BsmtQual
		2.53
	BsmtCond
		2.53
	BsmtFinType1
		2.53
	MasVnrType
		0.55
	MasVnrArea
		0.55
	Electrical
		0.07
	Id
		0
	MSSubClass
		0
	MSZoning
		0
	LotArea
		0
	Street
		0
	LotShape
		0
	LandContour
		0
	Utilities
		0
	LotConfig
		0
	LandSlope
		0
	Neighborhood
		0
	Condition1
		0
	Condition2
		0
	BldgType
		0
	HouseStyle
		0
	OverallQual
		0
	OverallCond
		0
	YearBuilt
		0
	YearRemodAdd
		0
	RoofStyle
		0
	RoofMatl
		0
	Exterior1st
		0
	Exterior2nd
		0
	ExterQual
		0
	ExterCond
		0
	Foundation
		0
	BsmtFinSF1
		0
	BsmtFinSF2
		0
	BsmtUnfSF
		0
	TotalBsmtSF
		0
	Heating
		0
	HeatingQC
		0
	CentralAir
		0
	1stFlrSF
		0
	2ndFlrSF
		0
	LowQualFinSF
		0
	GrLivArea
		0
	BsmtFullBath
		0
	BsmtHalfBath
		0
	FullBath
		0
	HalfBath
		0
	BedroomAbvGr
		0
	KitchenAbvGr
		0
	KitchenQual
		0
	TotRmsAbvGrd
		0
	Functional
		0
	Fireplaces
		0
	GarageCars
		0
	GarageArea
		0
	PavedDrive
		0
	WoodDeckSF
		0
	OpenPorchSF
		0
	EnclosedPorch
		0
	3SsnPorch
		0
	ScreenPorch
		0
	PoolArea
		0
	MiscVal
		0
	MoSold
		0
	YrSold
		0
	SaleType
		0
	SaleCondition
		0
	SalePrice
		0



In [116]:

    
# combine test and train.
dataset <- dplyr::bind_rows(dataset, test)



In [117]:

    
# quickly replace NAs. if numeric, replace with -1, if character replace with 'unknown'
# this gets rid of all NAs
dataset <- dataset %>% mutate_if(is.numeric, funs(ifelse(is.na(.), -1, .)))
dataset <- dataset %>% mutate_if(is.character, funs(ifelse(is.na(.), 'unknown', .)))



In [118]:

    
# combine the data to create dummy variables with caret. if data is split in train and test, i get errors

#create dummy variables
dmy <- caret::dummyVars(" ~ .", data = dataset, fullRank = T)
dataset <- as_tibble(predict(dmy, newdata = dataset))

#make the names usable in R
names(dataset) <- make.names(names(dataset), unique = TRUE)

dim(dataset)
head(test)
str(dataset)









    





	2919
	270








    





Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition

	1461   20   RH    80    11622 Pave  NA    Reg   Lvl   AllPub ...   120   0     NA    MnPrv NA        0 6     2010  WD    Normal
	1462   20   RL    81    14267 Pave  NA    IR1   Lvl   AllPub ...     0   0     NA    NA    Gar2  12500 6     2010  WD    Normal
	1463   60   RL    74    13830 Pave  NA    IR1   Lvl   AllPub ...     0   0     NA    MnPrv NA        0 3     2010  WD    Normal
	1464   60   RL    78     9978 Pave  NA    IR1   Lvl   AllPub ...     0   0     NA    NA    NA        0 6     2010  WD    Normal
	1465  120   RL    43     5005 Pave  NA    IR1   HLS   AllPub ...   144   0     NA    NA    NA        0 1     2010  WD    Normal
	1466   60   RL    75    10000 Pave  NA    IR1   Lvl   AllPub ...     0   0     NA    NA    NA        0 4     2010  WD    Normal









    



Classes 'tbl_df', 'tbl' and 'data.frame':	2919 obs. of  270 variables:
 $ Id                  : num  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass          : num  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoningFV          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSZoningRH          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSZoningRL          : num  1 1 1 1 1 1 1 1 0 1 ...
 $ MSZoningRM          : num  0 0 0 0 0 0 0 0 1 0 ...
 $ MSZoningunknown     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotFrontage         : num  65 80 68 60 84 85 75 -1 51 50 ...
 $ LotArea             : num  8450 9600 11250 9550 14260 ...
 $ StreetPave          : num  1 1 1 1 1 1 1 1 1 1 ...
 $ AlleyPave           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Alleyunknown        : num  1 1 1 1 1 1 1 1 1 1 ...
 $ LotShapeIR2         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotShapeIR3         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotShapeReg         : num  1 1 0 0 0 0 1 0 1 1 ...
 $ LandContourHLS      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LandContourLow      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LandContourLvl      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ UtilitiesNoSeWa     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Utilitiesunknown    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotConfigCulDSac    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotConfigFR2        : num  0 1 0 0 1 0 0 0 0 0 ...
 $ LotConfigFR3        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotConfigInside     : num  1 0 1 0 0 1 1 0 1 0 ...
 $ LandSlopeMod        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LandSlopeSev        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodBlueste : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodBrDale  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodBrkSide : num  0 0 0 0 0 0 0 0 0 1 ...
 $ NeighborhoodClearCr : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodCollgCr : num  1 0 1 0 0 0 0 0 0 0 ...
 $ NeighborhoodCrawfor : num  0 0 0 1 0 0 0 0 0 0 ...
 $ NeighborhoodEdwards : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodGilbert : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodIDOTRR  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodMeadowV : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodMitchel : num  0 0 0 0 0 1 0 0 0 0 ...
 $ NeighborhoodNAmes   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodNoRidge : num  0 0 0 0 1 0 0 0 0 0 ...
 $ NeighborhoodNPkVill : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodNridgHt : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodNWAmes  : num  0 0 0 0 0 0 0 1 0 0 ...
 $ NeighborhoodOldTown : num  0 0 0 0 0 0 0 0 1 0 ...
 $ NeighborhoodSawyer  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodSawyerW : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodSomerst : num  0 0 0 0 0 0 1 0 0 0 ...
 $ NeighborhoodStoneBr : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodSWISU   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodTimber  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodVeenker : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Condition1Feedr     : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Condition1Norm      : num  1 0 1 1 1 1 1 0 0 0 ...
 $ Condition1PosA      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition1PosN      : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Condition1RRAe      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition1RRAn      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition1RRNe      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition1RRNn      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition2Feedr     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition2Norm      : num  1 1 1 1 1 1 1 1 1 0 ...
 $ Condition2PosA      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition2PosN      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition2RRAe      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition2RRAn      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition2RRNn      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ BldgType2fmCon      : num  0 0 0 0 0 0 0 0 0 1 ...
 $ BldgTypeDuplex      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ BldgTypeTwnhs       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ BldgTypeTwnhsE      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ HouseStyle1.5Unf    : num  0 0 0 0 0 0 0 0 0 1 ...
 $ HouseStyle1Story    : num  0 1 0 0 0 0 1 0 0 0 ...
 $ HouseStyle2.5Fin    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ HouseStyle2.5Unf    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ HouseStyle2Story    : num  1 0 1 1 1 0 0 1 0 0 ...
 $ HouseStyleSFoyer    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ HouseStyleSLvl      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ OverallQual         : num  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond         : num  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt           : num  2003 1976 2001 1915 2000 ...
 $ YearRemodAdd        : num  2003 1976 2002 1970 2000 ...
 $ RoofStyleGable      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ RoofStyleGambrel    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofStyleHip        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofStyleMansard    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofStyleShed       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofMatlCompShg     : num  1 1 1 1 1 1 1 1 1 1 ...
 $ RoofMatlMembran     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofMatlMetal       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofMatlRoll        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofMatlTar.Grv     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofMatlWdShake     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofMatlWdShngl     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Exterior1stAsphShn  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Exterior1stBrkComm  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Exterior1stBrkFace  : num  0 0 0 0 0 0 0 0 1 0 ...
 $ Exterior1stCBlock   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Exterior1stCemntBd  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Exterior1stHdBoard  : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Exterior1stImStucc  : num  0 0 0 0 0 0 0 0 0 0 ...
  [list output truncated]



In [119]:

    
# split back out the dataset and test
test <- dplyr::filter(dataset, Id > 1460) 
dataset <- dplyr::filter(dataset, Id <= 1460)


#drop the sktest target column full of NAs
test$SalePrice <- NULL

dim(dataset)
dim(test)



In [120]:

    
#set aside a final full set of data to train on before splitting a validation set
final_dataset <- dataset

# split a validation dataset
validation_index <- createDataPartition(dataset$SalePrice, p=0.80, list=FALSE)
validation <- dataset[-validation_index,]
dataset <- dataset[validation_index,]



In [121]:

    
dataset_slice <- dplyr::select(dataset, everything()) %>% dplyr::slice(., 1:100)



In [122]:

    
formula <- SalePrice ~ .



In [134]:

    
# Ensemble Methods
ds <- dataset_slice

# try ensembles
control <- trainControl(method="cv", number=10)
metric <- "RMSE"
# Random Forest
set.seed(9)
fit.rf <- train(formula, data=ds, method="rf", preProc=c("medianImpute"), metric=metric, trControl=control, na.action = na.pass)
# Stochastic Gradient Boosting
set.seed(9)
fit.gbm <- train(formula, data=ds, method="gbm", preProc=c("medianImpute"),metric=metric, trControl=control, verbose=FALSE, na.action = na.pass)
# Cubist
set.seed(9)
fit.cubist <- train(formula, data=ds, method="cubist", preProc=c("medianImpute"),metric=metric, trControl=control, na.action = na.pass)
# xgb
set.seed(9)
fit.xgb <- train(formula, data=ds, method="xgbTree", preProc=c("medianImpute"),metric=metric, trControl=control, na.action = na.pass)
# Compare algorithms
ensemble_results <- resamples(list(RF=fit.rf, GBM=fit.gbm, CUBIST=fit.cubist, XGB=fit.xgb))
summary(ensemble_results)
bwplot(ensemble_results)









    



Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:Hmisc':

    combine

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin

Loading required package: gbm
Loading required package: splines
Loading required package: parallel
Loaded gbm 2.1.3
Loading required package: plyr
------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: 'plyr'

The following objects are masked from 'package:Hmisc':

    is.discrete, summarize

The following object is masked from 'package:DMwR':

    join

The following object is masked from 'package:lubridate':

    here

The following objects are masked from 'package:dplyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from 'package:purrr':

    compact

Loading required package: xgboost

Attaching package: 'xgboost'

The following object is masked from 'package:dplyr':

    slice







    





Call:
summary.resamples(object = ensemble_results)

Models: RF, GBM, CUBIST, XGB 
Number of resamples: 10 

RMSE 
        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
RF     18450   22630  26890 30800   37920 51920    0
GBM    15690   20380  29320 28540   34470 43160    0
CUBIST  7971   20200  25640 26980   32980 45570    0
XGB    18980   22860  29200 29600   34930 42660    0

Rsquared 
         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
RF     0.4583  0.8078 0.8295 0.8247  0.9021 0.9635    0
GBM    0.7479  0.8144 0.8547 0.8485  0.8946 0.9578    0
CUBIST 0.5044  0.7990 0.8939 0.8412  0.9171 0.9926    0
XGB    0.4435  0.7921 0.8593 0.8216  0.9098 0.9638    0



In [135]:

    
fit.cubist
plot(fit.cubist)









    





Cubist 

100 samples
269 predictors

Pre-processing: median imputation (269) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 88, 89, 89, 90, 91, 91, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE      Rsquared 
   1          0          32654.50  0.7994853
   1          5          28745.98  0.8031904
   1          9          29653.37  0.8166557
  10          0          27775.61  0.8338888
  10          5          27051.77  0.8335674
  10          9          26983.68  0.8411926
  20          0          27617.23  0.8388381
  20          5          27584.39  0.8300897
  20          9          27002.19  0.8466963

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were committees = 10 and neighbors = 9.

Cubist was best. First use random to find better tuning values. then tune around them.



In [136]:

    
# Tune the Cubist algorithm
ds <- dplyr::select(dataset, everything()) %>% dplyr::slice(., 1:1000)

control <- trainControl(method="cv", number=10, search='random')
metric <- "RMSE"
set.seed(7)
rand.cubist <- train(formula, data=ds, method="cubist", metric=metric, trControl=control, tuneLength = 20)
print(rand.cubist)
plot(rand.cubist)









    



Cubist 

1000 samples
 269 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 898, 900, 900, 901, 901, 900, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE      Rsquared 
    2         4          26897.35  0.8823320
    4         1          30915.70  0.8508779
   15         7          26499.00  0.8851216
   24         1          30823.87  0.8511905
   25         2          28459.79  0.8710751
   25         7          26633.79  0.8842461
   26         0          26379.56  0.8857581
   28         2          28442.14  0.8711172
   36         7          26751.72  0.8834508
   48         9          26570.75  0.8849160
   56         7          26735.78  0.8837600
   59         1          30848.77  0.8512799
   62         1          30872.81  0.8509946
   77         8          26361.98  0.8862668
   79         3          27564.60  0.8769157
   81         3          27513.80  0.8774290
   81         9          26315.59  0.8866018
   91         1          30726.54  0.8519100
  100         3          27540.22  0.8770529

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were committees = 81 and neighbors = 9.



In [142]:

    
# Tune the Cubist algorithm
control <- trainControl(method="cv", number=5)
metric <- "RMSE"
set.seed(7)
grid <- expand.grid(.committees=seq(5, 40, by=5), .neighbors=8)
tune.cubist <- train(formula, data=dataset, method="cubist", preProc=c("zv","medianImpute","BoxCox"),metric=metric, tuneGrid=grid, trControl=control, na.action = na.pass)
print(tune.cubist)
plot(tune.cubist)









    



Cubist 

1169 samples
 269 predictor

Pre-processing: median imputation (255), Box-Cox transformation (12),
 remove (14) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 935, 936, 935, 936, 934 
Resampling results across tuning parameters:

  committees  RMSE      Rsquared 
   5          36290.12  0.8012913
  10          35885.65  0.8033261
  15          35671.87  0.8049661
  20          35910.01  0.8028263
  25          35759.14  0.8046647
  30          35524.56  0.8067593
  35          35608.46  0.8061696
  40          35715.91  0.8049510

Tuning parameter 'neighbors' was held constant at a value of 8
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were committees = 30 and neighbors = 8.



In [137]:

    
set.seed(13)
predictions <- predict(rand.cubist, newdata=validation, na.action=na.pass)

MLmetrics::RMSE(predictions, validation$SalePrice)
MLmetrics::RMSLE(predictions, validation$SalePrice)









    




36266.6350361944






    




0.143377107091807



In [138]:

    
# make predictions on the test set for Kaggle submission
test$prediction <- predict(rand.cubist, newdata = test, na.action = na.pass)
head(test)
nrow(data.frame(test))









    





Id MSSubClass MSZoningFV MSZoningRH MSZoningRL MSZoningRM MSZoningunknown LotFrontage LotArea StreetPave ... SaleTypeNew SaleTypeOth SaleTypeunknown SaleTypeWD SaleConditionAdjLand SaleConditionAlloca SaleConditionFamily SaleConditionNormal SaleConditionPartial prediction

	1461     20     0       1       0       0       0       80      11622   1       ...     0       0       0       1       0       0       0       1       0       129184.5
	1462     20     0       0       1       0       0       81      14267   1       ...     0       0       0       1       0       0       0       1       0       162046.9
	1463     60     0       0       1       0       0       74      13830   1       ...     0       0       0       1       0       0       0       1       0       182289.0
	1464     60     0       0       1       0       0       78       9978   1       ...     0       0       0       1       0       0       0       1       0       191930.6
	1465    120     0       0       1       0       0       43       5005   1       ...     0       0       0       1       0       0       0       1       0       182107.4
	1466     60     0       0       1       0       0       75      10000   1       ...     0       0       0       1       0       0       0       1       0       177914.6









    




1459



In [139]:

    
my_solution <- dplyr::select(test, Id = Id, SalePrice = prediction)
my_solution$Id <- as.character(my_solution$Id)
readr::write_csv(x = data.frame(my_solution), path = "C:\\Work\\my_solution.csv")

head(my_solution, n=5)
tail(my_solution, n=5)









    





Id SalePrice

	1461    129184.5
	1462    162046.9
	1463    182289.0
	1464    191930.6
	1465    182107.4









    





Id SalePrice

	2915      94747.53
	2916      81349.26
	2917     175689.98
	2918     116572.81
	2919     228319.30



In [41]:

    
# correlation between results 
modelCor(ensemble_results) 
splom(ensemble_results)









    





RF GBM CUBIST XGB

	RF 1.0000000 0.9727554 0.8466115 0.9351958
	GBM 0.9727554 1.0000000 0.8989875 0.8833198
	CUBIST 0.8466115 0.8989875 1.0000000 0.6968309
	XGB 0.9351958 0.8833198 0.6968309 1.0000000

Cubist was the most accurate with an RMSE that was lower. I'll tune Cubist and see if I canget more out of it.

Cubist has two parameters that are tunable with caret: committees which is the number of boosting operations and neighbors which is used during prediction and is the number of instances used to correct the rule-based prediction (although the documentation is perhaps a little ambiguous on this).

For more information about Cubist see the help on the function ?cubist. Let’s first look at the default tuning parameter used by caret that resulted in our accurate model.



In [42]:

    
print(fit.cubist)









    



Cubist 

897 samples
 37 predictor

Pre-processing: median imputation (37) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 1052, 1052, 1052, 1053, 1053, 1052, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE      Rsquared 
   1          0          31444.04  0.8407646
   1          5          31227.82  0.8468689
   1          9          30726.75  0.8497293
  10          0          30193.08  0.8481363
  10          5          29988.91  0.8525039
  10          9          29587.85  0.8547257
  20          0          30087.44  0.8477844
  20          5          29773.66  0.8527302
  20          9          29449.11  0.8542496

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were committees = 20 and neighbors = 9.

We can see that the best RMSE was achieved with committees = 20 and neighbors = 5 Let’s use a grid search to tune around those values. We’ll try all committees between 15 and 25 and spot-check a neighbors value above and below 5.



In [50]:

    
# Tune the Cubist algorithm
control <- trainControl(method="cv", number=10)
metric <- "RMSE"
set.seed(7)
grid <- expand.grid(.committees=seq(90, 105, by=1), .neighbors=seq(8,9, by=1))
tune.cubist <- train(formula, data=dataset, method="cubist", preProc=c("medianImpute"),metric=metric, tuneGrid=grid, trControl=control, na.action = na.pass)
print(tune.cubist)
plot(tune.cubist)









    



Cubist 

897 samples
 37 predictor

Pre-processing: median imputation (37) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1053, 1052, 1053, 1052, 1053, 1051, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE      Rsquared 
   90         8          30630.31  0.8450439
   90         9          30588.15  0.8453062
   91         8          30567.88  0.8454597
   91         9          30525.36  0.8457290
   92         8          30617.43  0.8452248
   92         9          30575.54  0.8454881
   93         8          30522.99  0.8458538
   93         9          30479.80  0.8461305
   94         8          30544.03  0.8458157
   94         9          30501.19  0.8460874
   95         8          30508.47  0.8459756
   95         9          30465.05  0.8462542
   96         8          30490.35  0.8461352
   96         9          30447.80  0.8464053
   97         8          30472.73  0.8462456
   97         9          30431.11  0.8465093
   98         8          30502.67  0.8461257
   98         9          30461.84  0.8463816
   99         8          30484.91  0.8462641
   99         9          30443.26  0.8465282
  100         8          30527.75  0.8457794
  100         9          30486.82  0.8460370
  101         8               NaN        NaN
  101         9               NaN        NaN
  102         8               NaN        NaN
  102         9               NaN        NaN
  103         8               NaN        NaN
  103         9               NaN        NaN
  104         8               NaN        NaN
  104         9               NaN        NaN
  105         8               NaN        NaN
  105         9               NaN        NaN

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were committees = 97 and neighbors = 9.



In [45]:

    
# Tune the Cubist algorithm
control <- trainControl(method="cv", number=10, search='random')
metric <- "RMSE"
set.seed(13)
rand.cubist <- train(formula, data=dataset, method="cubist", preProc=c("medianImpute"), metric=metric, trControl=control, tuneLength = 20, na.action = na.pass)









    



Error in print(tune.cubist): object 'tune.cubist' not found
Traceback:

1. print(tune.cubist)



In [46]:

    
print(rand.cubist)
plot(rand.cubist)









    



Cubist 

897 samples
 37 predictor

Pre-processing: median imputation (37) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1051, 1053, 1052, 1052, 1053, 1052, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE      Rsquared 
   3          9          31465.29  0.8365468
   4          6          32426.92  0.8268674
   6          6          31309.40  0.8348669
  17          1          35053.92  0.8097407
  24          8          31981.41  0.8300350
  29          0          32521.91  0.8241532
  30          8          31671.73  0.8322066
  32          1          34538.31  0.8137425
  49          5          31173.86  0.8357425
  55          0          31688.43  0.8298428
  59          6          31095.27  0.8364246
  73          8          30681.01  0.8393753
  76          6          30856.26  0.8384297
  79          4          31052.81  0.8363843
  82          7          30686.00  0.8394768
  88          5          30883.87  0.8382161
  89          8          30472.20  0.8411396
  91          0          31233.98  0.8337150
  95          8          30364.91  0.8420689
  96          0          31108.85  0.8349099

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were committees = 95 and neighbors = 8.



In [47]:

    
set.seed(13)
predictions <- predict(rand.cubist, newdata=validation, na.action=na.pass)

MLmetrics::RMSE(predictions, validation$SalePrice)
MLmetrics::RMSLE(predictions, validation$SalePrice)









    




28146.8998857966






    




0.117144239852521



In [48]:

    
# make predictions on the test set for Kaggle submission
test$prediction <- predict(rand.cubist, newdata = test, na.action = na.pass)
head(test)
nrow(data.frame(test))









    





Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition prediction

	1461     20     RH      80      11622   Pave    NA      Reg     Lvl     AllPub  ...     0       NA      MnPrv   NA          0   6       2010    WD      Normal  126830.0
	1462     20     RL      81      14267   Pave    NA      IR1     Lvl     AllPub  ...     0       NA      NA      Gar2    12500   6       2010    WD      Normal  163595.7
	1463     60     RL      74      13830   Pave    NA      IR1     Lvl     AllPub  ...     0       NA      MnPrv   NA          0   3       2010    WD      Normal  182957.2
	1464     60     RL      78       9978   Pave    NA      IR1     Lvl     AllPub  ...     0       NA      NA      NA          0   6       2010    WD      Normal  193952.5
	1465    120     RL      43       5005   Pave    NA      IR1     HLS     AllPub  ...     0       NA      NA      NA          0   1       2010    WD      Normal  186825.6
	1466     60     RL      75      10000   Pave    NA      IR1     Lvl     AllPub  ...     0       NA      NA      NA          0   4       2010    WD      Normal  173457.3









    




1459



In [49]:

    
my_solution <- dplyr::select(test, Id = Id, SalePrice = prediction)
readr::write_csv(x = data.frame(my_solution), path = "C:\\Work\\my_solution.csv")

head(my_solution, n=20)
tail(my_solution, n=20)









    





Id SalePrice

	1461     126830.02
	1462     163595.69
	1463     182957.19
	1464     193952.52
	1465     186825.59
	1466     173457.34
	1467     182741.45
	1468     163119.27
	1469     186589.91
	1470     118209.41
	1471     206801.23
	1472      92614.41
	1473      90469.55
	1474     150452.75
	1475     103124.55
	1476     332124.84
	1477     247964.45
	1478     284008.00
	1479     258617.81
	1480     500307.91









    





Id SalePrice

	2900     165006.59
	2901     210373.39
	2902     196022.58
	2903     335850.72
	2904     351279.34
	2905      84391.13
	2906     221652.44
	2907     106752.44
	2908     130474.32
	2909     138493.48
	2910      79982.55
	2911      78002.70
	2912     148282.50
	2913      84687.27
	2914      71243.44
	2915      83983.96
	2916      82510.62
	2917     182881.77
	2918     109052.62
	2919     228521.91



In [ ]:



In [ ]:



In [ ]:



In [56]:

    
# lm
set.seed(7)
fit.lm1 <- train(formula, data=dataset, method="lm", metric=metric, trControl=control)
# GLM
set.seed(7)
fit.glm1 <- train(formula, data=dataset, method="glm", metric=metric, trControl=control)
# GLMNET
set.seed(7)
fit.glmnet1 <- train(formula, data=dataset, method="glmnet", metric=metric, trControl=control)
# SVM
set.seed(7)
fit.svm1 <- train(formula, data=dataset, method="svmRadial", metric=metric, trControl=control)
# CART
set.seed(7)
grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart1 <- train(formula, data=dataset, method="rpart", metric=metric, tuneGrid=grid, trControl=control)
# kNN
set.seed(7)
fit.knn1 <- train(formula, data=dataset, method="knn", metric=metric, trControl=control)


# Compare algorithms
results <- resamples(list(LM1=fit.lm1, GLM1=fit.glm1, GLMNET1=fit.glmnet1, SVM1=fit.svm1, CART1=fit.cart1, KNN1=fit.knn1))
summary(results)
dotplot(results)









    



Loading required package: glmnet
Loading required package: Matrix

Attaching package: 'Matrix'

The following object is masked from 'package:tidyr':

    expand

Loading required package: foreach

Attaching package: 'foreach'

The following objects are masked from 'package:purrr':

    accumulate, when

Loaded glmnet 2.0-10

Loading required package: rpart






    





Call:
summary.resamples(object = results)

Models: LM1, GLM1, GLMNET1, SVM1, CART1, KNN1 
Number of resamples: 30 

RMSE 
         Min. 1st Qu. Median  Mean 3rd Qu.   Max. NA's
LM1     3.514   4.056  4.773 4.963   5.529  9.448    0
GLM1    3.514   4.056  4.773 4.963   5.529  9.448    0
GLMNET1 3.498   4.030  4.767 4.957   5.517  9.480    0
SVM1    2.410   2.836  3.272 3.537   3.870  6.708    0
CART1   2.797   3.434  4.272 4.531   5.437  9.248    0
KNN1    4.751   6.221  6.738 6.946   7.840 10.400    0

Rsquared 
          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
LM1     0.3169  0.6682 0.7428 0.7293  0.7984 0.8882    0
GLM1    0.3169  0.6682 0.7428 0.7293  0.7984 0.8882    0
GLMNET1 0.3127  0.6678 0.7436 0.7296  0.7987 0.8905    0
SVM1    0.6539  0.8153 0.8843 0.8559  0.9160 0.9533    0
CART1   0.3614  0.6733 0.8197 0.7686  0.8618 0.9026    0
KNN1    0.1971  0.4022 0.4728 0.4679  0.5339 0.6475    0



In [57]:

    
# correlation between results 
modelCor(results) 
splom(results)









    





LM1 GLM1 GLMNET1 SVM1 CART1 KNN1

	LM1 1.0000000 1.0000000 0.9999138 0.7649814 0.7234765 0.8243609
	GLM1 1.0000000 1.0000000 0.9999138 0.7649814 0.7234765 0.8243609
	GLMNET1 0.9999138 0.9999138 1.0000000 0.7614532 0.7232234 0.8271828
	SVM1 0.7649814 0.7649814 0.7614532 1.0000000 0.5216863 0.4984776
	CART1 0.7234765 0.7234765 0.7232234 0.5216863 1.0000000 0.5559818
	KNN1 0.8243609 0.8243609 0.8271828 0.4984776 0.5559818 1.0000000

Are high correlations among features preventing a better prediction?



In [58]:

    
# remove correlated attributes
# find attributes that are highly corrected
set.seed(7)
cutoff <- 0.70
correlations <- cor(dataset[,1:13])
highlyCorrelated <- findCorrelation(correlations, cutoff=cutoff)
for (value in highlyCorrelated) {
    print(names(dataset)[value])
}
# create a new dataset without highly corrected features
dataset_features <- dataset[,-highlyCorrelated]
dim(dataset_features)









    



[1] "indus"
[1] "nox"
[1] "tax"
[1] "dis"






    





	407
	10

4 features dropped for being over 70% correlated. Now I'll run the baseline again on the new dataset.



In [59]:

    
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "RMSE"

# lm
set.seed(7)
fit.lm <- train(medv~., data=dataset_features, method="lm", metric=metric, preProc=c("center", "scale"), trControl=control)
# GLM
set.seed(7)
fit.glm <- train(medv~., data=dataset_features, method="glm", metric=metric, preProc=c("center", "scale"), trControl=control)
# GLMNET
set.seed(7)
fit.glmnet <- train(medv~., data=dataset_features, method="glmnet", metric=metric, preProc=c("center", "scale"), trControl=control)
# SVM
set.seed(7)
fit.svm <- train(medv~., data=dataset_features, method="svmRadial", metric=metric, preProc=c("center", "scale"), trControl=control)
# CART
set.seed(7)
grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart <- train(medv~., data=dataset_features, method="rpart", metric=metric, tuneGrid=grid, preProc=c("center", "scale"), trControl=control)
# kNN
set.seed(7)
fit.knn <- train(medv~., data=dataset_features, method="knn", metric=metric, preProc=c("center", "scale"), trControl=control)


# Compare algorithms
feature_results <- resamples(list(LM=fit.lm, GLM=fit.glm, GLMNET=fit.glmnet, SVM=fit.svm, CART=fit.cart, KNN=fit.knn))
summary(feature_results)
dotplot(feature_results)









    





Call:
summary.resamples(object = feature_results)

Models: LM, GLM, GLMNET, SVM, CART, KNN 
Number of resamples: 10 

RMSE 
        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
LM     3.903   4.429  4.592 5.266   5.427 9.982    0
GLM    3.903   4.429  4.592 5.266   5.427 9.982    0
GLMNET 3.807   4.410  4.518 5.192   5.307 9.925    0
SVM    2.938   3.301  3.951 4.342   4.717 8.498    0
CART   2.661   3.293  3.952 4.389   4.435 9.558    0
KNN    2.722   3.416  4.195 4.463   4.785 8.889    0

Rsquared 
         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
LM     0.2505  0.6819 0.7398 0.6995  0.8049 0.8877    0
GLM    0.2505  0.6819 0.7398 0.6995  0.8049 0.8877    0
GLMNET 0.2554  0.6929 0.7584 0.7083  0.8105 0.8901    0
SVM    0.4877  0.7096 0.8381 0.7853  0.8848 0.9116    0
CART   0.3310  0.7503 0.8239 0.7780  0.8636 0.9360    0
KNN    0.4105  0.7119 0.8158 0.7698  0.8758 0.9117    0



In [60]:

    
fit.svm









    





Support Vector Machines with Radial Basis Function Kernel 

407 samples
  9 predictor

Pre-processing: centered (9), scaled (9) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 366, 367, 366, 366, 367, 367, ... 
Resampling results across tuning parameters:

  C     RMSE      Rsquared 
  0.25  5.136846  0.7298906
  0.50  4.675332  0.7619518
  1.00  4.341606  0.7853084

Tuning parameter 'sigma' was held constant at a value of 0.1858149
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were sigma = 0.1858149 and C = 1.

Removing highly-correlated predictors helped.

Now I'll tune the SVM algorithm.

The C parameter is the cost constraint used by SVM. Learn more in the help for the ksvm function ?ksvm. We can see from previous results that a C value of 1.0 is a good starting point. Let’s design a grid search around a C value of 1. We might see a small trend of decreasing RMSE with increasing C, so let’s try all integer C values between 1 and 10. Another parameter that caret let’s us tune is the sigma parameter. This is a smoothing parameter. Good sigma values often start around 0.1, so we will try numbers before and after.



In [73]:

    
# tune SVM sigma and C parametres
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
set.seed(7)
grid <- expand.grid(.sigma=c(0.025, 0.05, 0.1, 0.15), .C=seq(1, 15, by=1))
fit.svm <- train(formula, data=dataset, method="svmRadial", metric=metric, tuneGrid=grid, preProc=c("BoxCox"), trControl=control)
print(fit.svm)
plot(fit.svm)









    



Support Vector Machines with Radial Basis Function Kernel 

407 samples
 13 predictor

Pre-processing: Box-Cox transformation (11) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 366, 367, 366, 366, 367, 367, ... 
Resampling results across tuning parameters:

  sigma  C   RMSE      Rsquared 
  0.025   1  3.889703  0.8335201
  0.025   2  3.685009  0.8470869
  0.025   3  3.562851  0.8553298
  0.025   4  3.453041  0.8628558
  0.025   5  3.372501  0.8686287
  0.025   6  3.306693  0.8731149
  0.025   7  3.261471  0.8761873
  0.025   8  3.232191  0.8780827
  0.025   9  3.208426  0.8797434
  0.025  10  3.186740  0.8812147
  0.025  11  3.169472  0.8824359
  0.025  12  3.155786  0.8835105
  0.025  13  3.145025  0.8843587
  0.025  14  3.132858  0.8851853
  0.025  15  3.120282  0.8860505
  0.050   1  3.771428  0.8438368
  0.050   2  3.484116  0.8634056
  0.050   3  3.282230  0.8768963
  0.050   4  3.179856  0.8829293
  0.050   5  3.105290  0.8873315
  0.050   6  3.054516  0.8907211
  0.050   7  3.024010  0.8925927
  0.050   8  3.003371  0.8936101
  0.050   9  2.984457  0.8944677
  0.050  10  2.977085  0.8948000
  0.050  11  2.968672  0.8953416
  0.050  12  2.962058  0.8957037
  0.050  13  2.955985  0.8959431
  0.050  14  2.951290  0.8961327
  0.050  15  2.947907  0.8962569
  0.100   1  3.762027  0.8453751
  0.100   2  3.300432  0.8747723
  0.100   3  3.142907  0.8825268
  0.100   4  3.071231  0.8862783
  0.100   5  3.028898  0.8890841
  0.100   6  3.015042  0.8900253
  0.100   7  3.009815  0.8904964
  0.100   8  3.005077  0.8909034
  0.100   9  3.006147  0.8908668
  0.100  10  3.006943  0.8908635
  0.100  11  3.005785  0.8910132
  0.100  12  3.005781  0.8911024
  0.100  13  3.006638  0.8911363
  0.100  14  3.008683  0.8911141
  0.100  15  3.010580  0.8910613
  0.150   1  3.835849  0.8408209
  0.150   2  3.318208  0.8716379
  0.150   3  3.171005  0.8793969
  0.150   4  3.151071  0.8809872
  0.150   5  3.149461  0.8811425
  0.150   6  3.154374  0.8807765
  0.150   7  3.156741  0.8806358
  0.150   8  3.157200  0.8806536
  0.150   9  3.156256  0.8807690
  0.150  10  3.156134  0.8807506
  0.150  11  3.156458  0.8807279
  0.150  12  3.156249  0.8807845
  0.150  13  3.155070  0.8809160
  0.150  14  3.154128  0.8810077
  0.150  15  3.154153  0.8809820

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were sigma = 0.05 and C = 15.

We can see that the sigma values flatten out with larger C cost constraints. It looks like we might do well with a sigma of 0.05 and a C of 8. This gives us a respectable RMSE of 2.977085. If we wanted to take this further, we could try even more fine tuning with more grid searches. We could also explore trying to tune other parameters of the underlying ksvm() function. Finally and as already mentioned, we could perform some grid searches on the other nonlinear regression methods.



In [74]:

    
set.seed(13)
predictions <- predict(fit.svm, newdata=validation, na.action=na.pass)

MLmetrics::RMSE(predictions, validation$medv)









    




2.85376796253756



In [63]:

    
glimpse(dataset_features)









    



Observations: 407
Variables: 10
$ crim    <dbl> 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, 0.144...
$ zn      <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 0.0, 0.0, ...
$ chas    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ rm      <dbl> 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631, 5.8...
$ age     <dbl> 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 39.0, 61.8...
$ rad     <dbl> 2, 2, 3, 3, 3, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
$ ptratio <dbl> 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 21.0,...
$ b       <dbl> 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90, 386...
$ lstat   <dbl> 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 15.71, 8...
$ medv    <dbl> 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 21.7, 20.4,...



In [68]:

    
# apply spatialSign
centerScale <- caret::preProcess(dataset_features[,1:9], method = c(
   "center","scale",
    "spatialSign" #needs to be scaled first
    ))
ssData <- predict(centerScale, newdata = dataset_features)



In [69]:

    
head(ssData)









    





crim zn chas rm age rad ptratio b lstat medv

	-0.3057415 -0.33539714 -0.2050675  0.1414989  0.2536941 -0.6178320 -0.23181237 0.3182620  -0.36173413 21.6       
	-0.1988610 -0.21814848 -0.1333797  0.5876024 -0.1266633 -0.4018493 -0.15077504 0.1854101  -0.56139321 34.7       
	-0.1962334 -0.21557402 -0.1318056  0.4608017 -0.3729155 -0.3439066  0.04571398 0.1926592  -0.62350965 33.4       
	-0.2102886 -0.23342609 -0.1427207  0.6023791 -0.2565155 -0.3723861  0.04949964 0.2215005  -0.51193445 36.2       
	-0.2707984 -0.29727713 -0.1817603  0.1333720 -0.2261989 -0.4742481  0.06303969 0.2619902  -0.66240439 28.7       
	-0.2326514  0.04737754 -0.1587674 -0.2062440 -0.0434955 -0.2860895 -0.85703048 0.2381948  -0.03013268 22.9



In [71]:

    
# tune SVM sigma and C parametres
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
set.seed(7)
grid <- expand.grid(.sigma=c(0.025, 0.05, 0.1, 0.15, 0.2, 0.25), .C=seq(1, 15, by=1))
fit.svm <- train(formula, data=ssData, method="svmRadial", metric=metric, tuneGrid=grid, preProc=c("YeoJohnson"), trControl=control)
print(fit.svm)
plot(fit.svm)









    



Support Vector Machines with Radial Basis Function Kernel 

407 samples
  9 predictor

Pre-processing: Yeo-Johnson transformation (9) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 366, 367, 366, 366, 367, 367, ... 
Resampling results across tuning parameters:

  sigma  C   RMSE      Rsquared 
  0.025   1  4.799036  0.7545641
  0.025   2  4.598553  0.7706094
  0.025   3  4.536308  0.7755918
  0.025   4  4.510645  0.7781409
  0.025   5  4.483821  0.7804615
  0.025   6  4.458433  0.7824321
  0.025   7  4.437082  0.7839830
  0.025   8  4.422740  0.7850595
  0.025   9  4.407503  0.7863258
  0.025  10  4.394818  0.7875119
  0.025  11  4.386376  0.7884606
  0.025  12  4.380142  0.7891152
  0.025  13  4.372431  0.7899239
  0.025  14  4.365381  0.7906252
  0.025  15  4.356737  0.7914507
  0.050   1  4.494527  0.7804428
  0.050   2  4.366527  0.7898645
  0.050   3  4.319540  0.7942297
  0.050   4  4.277365  0.7975284
  0.050   5  4.237597  0.8007294
  0.050   6  4.201159  0.8037178
  0.050   7  4.164792  0.8065783
  0.050   8  4.136943  0.8085899
  0.050   9  4.119564  0.8096662
  0.050  10  4.111661  0.8099835
  0.050  11  4.106955  0.8101161
  0.050  12  4.104801  0.8102737
  0.050  13  4.104199  0.8104075
  0.050  14  4.101161  0.8106189
  0.050  15  4.099134  0.8108185
  0.100   1  4.222472  0.8027630
  0.100   2  4.087170  0.8122060
  0.100   3  4.057028  0.8139461
  0.100   4  4.061125  0.8135993
  0.100   5  4.065650  0.8129651
  0.100   6  4.066467  0.8129084
  0.100   7  4.064548  0.8130995
  0.100   8  4.060870  0.8134430
  0.100   9  4.059575  0.8135766
  0.100  10  4.058182  0.8138386
  0.100  11  4.053929  0.8141974
  0.100  12  4.047732  0.8147264
  0.100  13  4.042849  0.8150944
  0.100  14  4.036432  0.8156718
  0.100  15  4.028118  0.8163635
  0.150   1  4.117026  0.8099370
  0.150   2  4.054519  0.8138563
  0.150   3  4.039556  0.8147772
  0.150   4  4.033269  0.8154278
  0.150   5  4.029878  0.8157925
  0.150   6  4.010496  0.8174739
  0.150   7  3.978291  0.8203540
  0.150   8  3.949712  0.8226881
  0.150   9  3.927998  0.8243310
  0.150  10  3.909650  0.8257058
  0.150  11  3.892513  0.8269691
  0.150  12  3.879805  0.8279124
  0.150  13  3.871650  0.8285214
  0.150  14  3.869265  0.8286673
  0.150  15  3.871681  0.8284644
  0.200   1  4.077672  0.8120301
  0.200   2  4.034397  0.8150402
  0.200   3  4.008699  0.8175385
  0.200   4  3.980353  0.8200456
  0.200   5  3.926935  0.8247752
  0.200   6  3.882278  0.8282661
  0.200   7  3.849855  0.8307015
  0.200   8  3.835873  0.8317017
  0.200   9  3.834053  0.8316854
  0.200  10  3.842898  0.8308844
  0.200  11  3.855413  0.8297713
  0.200  12  3.867875  0.8287257
  0.200  13  3.876857  0.8279960
  0.200  14  3.883568  0.8274430
  0.200  15  3.888463  0.8270008
  0.250   1  4.070606  0.8128804
  0.250   2  4.005490  0.8179382
  0.250   3  3.963227  0.8218315
  0.250   4  3.886139  0.8282597
  0.250   5  3.830741  0.8323303
  0.250   6  3.806068  0.8339396
  0.250   7  3.797364  0.8340582
  0.250   8  3.803819  0.8332427
  0.250   9  3.809590  0.8327111
  0.250  10  3.811915  0.8325831
  0.250  11  3.818065  0.8322439
  0.250  12  3.830207  0.8314948
  0.250  13  3.841450  0.8307685
  0.250  14  3.852209  0.8300835
  0.250  15  3.861949  0.8295233

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were sigma = 0.25 and C = 7.



In [72]:

    
set.seed(13)
predictions <- predict(fit.svm, newdata=validation, na.action=na.pass)
caret:::RMSE(predictions, validation$medv)
MLmetrics::RMSE(predictions, validation$medv)









    




8.52138738047781






    




8.52138738047781



In [ ]:

Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
1	60	RL	65	8450	Pave	NA	Reg	Lvl	AllPub	...	NA	NA	NA	0	2	2008	WD	Normal	208500
2	20	RL	80	9600	Pave	NA	Reg	Lvl	AllPub	...	NA	NA	NA	0	5	2007	WD	Normal	181500
3	60	RL	68	11250	Pave	NA	IR1	Lvl	AllPub	...	NA	NA	NA	0	9	2008	WD	Normal	223500
4	70	RL	60	9550	Pave	NA	IR1	Lvl	AllPub	...	NA	NA	NA	0	2	2006	WD	Abnorml	140000
5	60	RL	84	14260	Pave	NA	IR1	Lvl	AllPub	...	NA	NA	NA	0	12	2008	WD	Normal	250000
6	50	RL	85	14115	Pave	NA	IR1	Lvl	AllPub	...	NA	MnPrv	Shed	700	10	2009	WD	Normal	143000

Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
1461	20	RH	80	11622	Pave	NA	Reg	Lvl	AllPub	...	120	NA	MnPrv	NA	0	6	2010	WD	Normal
1462	20	RL	81	14267	Pave	NA	IR1	Lvl	AllPub	...	0	NA	NA	Gar2	12500	6	2010	WD	Normal
1463	60	RL	74	13830	Pave	NA	IR1	Lvl	AllPub	...	0	NA	MnPrv	NA	0	3	2010	WD	Normal
1464	60	RL	78	9978	Pave	NA	IR1	Lvl	AllPub	...	0	NA	NA	NA	0	6	2010	WD	Normal
1465	120	RL	43	5005	Pave	NA	IR1	HLS	AllPub	...	144	NA	NA	NA	0	1	2010	WD	Normal
1466	60	RL	75	10000	Pave	NA	IR1	Lvl	AllPub	...	0	NA	NA	NA	0	4	2010	WD	Normal

Id	MSSubClass	MSZoningRH	MSZoningRL	LotFrontage	LotArea	StreetPave	...	SaleTypeWD	SaleConditionNormal	prediction
1461	20	1	0	80	11622	1	...	1	1	129184.5
1462	20	0	1	81	14267	1	...	1	1	162046.9
1463	60	0	1	74	13830	1	...	1	1	182289.0
1464	60	0	1	78	9978	1	...	1	1	191930.6
1465	120	0	1	43	5005	1	...	1	1	182107.4
1466	60	0	1	75	10000	1	...	1	1	177914.6

Id	SalePrice
2915	94747.53
2916	81349.26
2917	175689.98
2918	116572.81
2919	228319.30

	RF	GBM	CUBIST	XGB
RF	1.0000000	0.9727554	0.8466115	0.9351958
GBM	0.9727554	1.0000000	0.8989875	0.8833198
CUBIST	0.8466115	0.8989875	1.0000000	0.6968309
XGB	0.9351958	0.8833198	0.6968309	1.0000000

Id	SalePrice
1461	126830.02
1462	163595.69
1463	182957.19
1464	193952.52
1465	186825.59
1466	173457.34
1467	182741.45
1468	163119.27
1469	186589.91
1470	118209.41
1471	206801.23
1472	92614.41
1473	90469.55
1474	150452.75
1475	103124.55
1476	332124.84
1477	247964.45
1478	284008.00
1479	258617.81
1480	500307.91

Id	SalePrice
2900	165006.59
2901	210373.39
2902	196022.58
2903	335850.72
2904	351279.34
2905	84391.13
2906	221652.44
2907	106752.44
2908	130474.32
2909	138493.48
2910	79982.55
2911	78002.70
2912	148282.50
2913	84687.27
2914	71243.44
2915	83983.96
2916	82510.62
2917	182881.77
2918	109052.62
2919	228521.91

	LM1	GLM1	GLMNET1	SVM1	CART1	KNN1
LM1	1.0000000	1.0000000	0.9999138	0.7649814	0.7234765	0.8243609
GLM1	1.0000000	1.0000000	0.9999138	0.7649814	0.7234765	0.8243609
GLMNET1	0.9999138	0.9999138	1.0000000	0.7614532	0.7232234	0.8271828
SVM1	0.7649814	0.7649814	0.7614532	1.0000000	0.5216863	0.4984776
CART1	0.7234765	0.7234765	0.7232234	0.5216863	1.0000000	0.5559818
KNN1	0.8243609	0.8243609	0.8271828	0.4984776	0.5559818	1.0000000

crim	zn	chas	rm	age	rad	ptratio	b	lstat	medv
-0.3057415	-0.33539714	-0.2050675	0.1414989	0.2536941	-0.6178320	-0.23181237	0.3182620	-0.36173413	21.6
-0.1988610	-0.21814848	-0.1333797	0.5876024	-0.1266633	-0.4018493	-0.15077504	0.1854101	-0.56139321	34.7
-0.1962334	-0.21557402	-0.1318056	0.4608017	-0.3729155	-0.3439066	0.04571398	0.1926592	-0.62350965	33.4
-0.2102886	-0.23342609	-0.1427207	0.6023791	-0.2565155	-0.3723861	0.04949964	0.2215005	-0.51193445	36.2
-0.2707984	-0.29727713	-0.1817603	0.1333720	-0.2261989	-0.4742481	0.06303969	0.2619902	-0.66240439	28.7
-0.2326514	0.04737754	-0.1587674	-0.2062440	-0.0434955	-0.2860895	-0.85703048	0.2381948	-0.03013268	22.9